Missing values are a common problem in data science and machine learning. Removing instances with missing values can adversely affect the quality of further data analysis. This is exacerbated when there are relatively many more features than instances, and thus the proportion of affected instances is high. Such a scenario is common in many important domains, for example, single nucleotide polymorphism (SNP) datasets provide a large number of features over a genome for a relatively small number of individuals. To preserve as much information as possible prior to modeling, a rigorous imputation scheme is acutely needed. While Denoising Autoencoders is a state-of-the-art method for imputation in high-dimensional data, they still require enough complete cases to be trained on which is often not available in real-world problems. In this paper, we consider missing value imputation as a multi-label classification problem and propose Chains of Autoreplicative Random Forests. Using multi-label Random Forests instead of neural networks works well for low-sampled data as there are fewer parameters to optimize. Experiments on several SNP datasets show that our algorithm effectively imputes missing values based only on information from the dataset and exhibits better performance than standard algorithms that do not require any additional information. In this paper, the algorithm is implemented specifically for SNP data, but it can easily be adapted for other cases of missing value imputation.
translated by 谷歌翻译
The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.
translated by 谷歌翻译
决策树由于易于解释性而闻名。为了提高准确性,我们需要种植深树或树木的合奏。这些很难解释,抵消了它们的原始好处。Shapley的价值最近已成为解释基于树的机器学习模型预测的流行方式。它为独立于树结构的特征提供了线性加权。受欢迎程度的上升主要归因于Treeshap,该treeshap解决了多项式时间中的一般指数复杂性问题。在该行业广泛采用之后,需要更有效的算法。本文提出了一种更有效,更直接的算法:线性三链。像Treeshap一样,线性三膜是精确的,需要相同数量的内存。
translated by 谷歌翻译
多标签分类器估计每一组概念标签的二进制标签状态(相关与无关),对于任何给定的实例。概率多标签分类器在此类标签状态(标签的幂列)的所有可能的标签组组合(标签的功能)的所有可能的标签集组合中提供了预测性的后验分布,我们可以通过选择对应于该分布的最大预期准确性的标签集,从而提供最佳的估计值。例如,在最大化精确匹配精度时,我们提供了分布的模式。但是,这与我们在这样的估计中可能拥有的信心有何关系?置信度是多标签分类器(通常在机器学习中)现实世界应用的重要组成部分,并且是解释性和解释性的重要组成部分。但是,如何在多标签上下文中提供信心并与特定准确度量有关,也不清楚如何提供与预期准确性良好相关的信心,这在现实中最有价值 - 世界决策。在本文中,我们将预期准确性视为具有给定精度度量的信心的替代品。我们假设可以从多标签预测分布中估算预期精度。我们检查了七个候选功能,以估计预测分布的预期准确性的能力。我们发现其中三个与预期准确性相关,并且具有稳健性。此外,我们确定可以单独使用每个候选功能来估计锤击相似性,但是候选者的组合最适合预期的jaccard索引和精确匹配。
translated by 谷歌翻译
非侵入性负载监控(NILM)试图通过从单个骨料测量中估算单个设备功率使用来节省能源。深度神经网络在尝试解决尼尔姆问题方面变得越来越流行。但是,大多数使用的模型用于负载识别,而不是在线源分离。在源分离模型中,大多数使用单任务学习方法,其中神经网络专门为每个设备培训。该策略在计算上是昂贵的,并且忽略了多个电器可以同时活跃的事实和它们之间的依赖性。其余模型不是因果关系,这对于实时应用很重要。受语音分离模型Convtas-Net的启发,我们提出了Conv-Nilm-Net,这是端到端尼尔姆的完全卷积框架。 Conv-NILM-NET是多元设备源分离的因果模型。我们的模型在两个真实数据集和英国销售的两个真实数据集上进行了测试,并且显然超过了最新技术的状态,同时保持尺寸明显小于竞争模型。
translated by 谷歌翻译
在多标签学习中,单个数据点与多个目标标签相关联的多任务学习的特定情况,在文献中广泛假定,为了获得最佳准确性,应明确建模标签之间的依赖性。这个前提导致提供的方法的扩散,以学习和预测标签,例如,一个标签的预测会影响对其他标签的预测。即使现在人们承认,在许多情况下,最佳性能并不需要一种依赖模型,但此类模型在某些情况下继续超越独立模型,这暗示了其对其性能的替代解释以外的标签依赖性,而文献仅是文献才是最近开始解开。利用并扩展了最近的发现,我们将多标签学习的原始前提转移到其头上,并在任务标签之间没有任何可衡量的依赖性的情况下特别处理联合模型的问题;例如,当任务标签来自单独的问题域时。我们将洞察力从这项研究转移到建立转移学习方法,该方法挑战了长期以来的假设,即任务的可转移性来自源和目标域或模型之间相似性的测量。这使我们能够设计和测试一种传输学习方法,该方法是模型驱动的,而不是纯粹的数据驱动,并且它是黑匣子和模型不合时式(可以考虑任何基本模型类)。我们表明,从本质上讲,我们可以根据源模型容量创建任务依赖性。我们获得的结果具有重要的含义,并在多标签和转移学习领域为将来的工作提供了明确的方向。
translated by 谷歌翻译
目的:机器学习技术已广泛用于12铅心电图(ECG)分析。对于生理时间序列,基于领域知识的深度学习(DL)优势(FE)方法仍然是一个悬而未决的问题。此外,尚不清楚将DL与FE结合起来是否可以提高性能。方法:我们考虑了要解决这些研究差距的三个任务:心律不齐的诊断(多类 - 甲状腺素分类),房颤风险预测(二进制分类)和年龄估计(回归)。我们使用2.3m 12铅ECG录音的总体数据集来培训每个任务的以下模型:i)随机森林将FE作为输入作为经典的机器学习方法培训; ii)端到端DL模型; iii)Fe+DL的合并模型。结果:FE得出的结果与DL产生了可比的结果,同时需要较少的两个分类任务数据,并且对于回归任务而言,DL的表现优于DL。对于所有任务,将FE与DL合并并不能单独提高DL的性能。结论:我们发现,对于传统的12铅ECG诊断任务,DL并未对FE产生有意义的改进,而它显着改善了非传统回归任务。我们还发现,将FE与DL相结合并不能单独改善DL,这表明FE与DL学到的功能是多余的。意义:我们的发现提供了有关哪种机器学习策略和数据制度的重要建议,可以选择基于12 Lead ECG开发新机器学习模型的任务。
translated by 谷歌翻译
在这项工作中,我们分析了嘈杂的重要抽样(IS),即,正在使用对目标密度的嘈杂评估。我们展示了一般框架,并获得最佳建议密度为噪音是估算。最佳建议包含嘈杂的实现方差的信息,提出噪声功率更高的区域中的点。我们还比较使用最佳提案与以前在嘈杂中考虑的最佳最优方法是框架。
translated by 谷歌翻译
Variational autoencoders and Helmholtz machines use a recognition network (encoder) to approximate the posterior distribution of a generative model (decoder). In this paper we study the necessary and sufficient properties of a recognition network so that it can model the true posterior distribution exactly. These results are derived in the general context of probabilistic graphical modelling / Bayesian networks, for which the network represents a set of conditional independence statements. We derive both global conditions, in terms of d-separation, and local conditions for the recognition network to have the desired qualities. It turns out that for the local conditions the property perfectness (for every node, all parents are joined) plays an important role.
translated by 谷歌翻译
Scholarly text is often laden with jargon, or specialized language that divides disciplines. We extend past work that characterizes science at the level of word types, by using BERT-based word sense induction to find additional words that are widespread but overloaded with different uses across fields. We define scholarly jargon as discipline-specific word types and senses, and estimate its prevalence across hundreds of fields using interpretable, information-theoretic metrics. We demonstrate the utility of our approach for science of science and computational sociolinguistics by highlighting two key social implications. First, we measure audience design, and find that most fields reduce jargon when publishing in general-purpose journals, but some do so more than others. Second, though jargon has varying correlation with articles' citation rates within fields, it nearly always impedes interdisciplinary impact. Broadly, our measurements can inform ways in which language could be revised to serve as a bridge rather than a barrier in science.
translated by 谷歌翻译